arXiv:1508.00338vl [astro-ph.IM] 3 Aug 2015 


Mon. Not. R. Astron. Soc. OOP. [Tl fT^ (2015) Printed 4 August 2015 (MN style file v2.2) 


An autoencoder of stellar spectra and its application in 
automatically estimating atmospheric parameters 

Tan Yang, Xiangru LP 

School of Mathematical Sciences, South China Normal University, No. 55, West of Yat-sen Avenue, Guangzhou, 510631, China 


4 August 2015 


ABSTRACT 

This article investigates the problem of estimating stellar atmospheric parameters 
from spectra. Feature extraction is a key procedure in estimating stellar parameters 
automatically. We propose a scheme for spectral feature extraction and atmospheric 
parameter estimation using the following three procedures: firstly, learn a set of basic 
structure elements (BSE) from stellar spectra using an autoencoder; secondly, extract 
representative features from stellar spectra based on the learned BSEs through some 
procedures of convolution and pooling; thirdly, estimate stellar parameters (Tgff, log g, 
[Fe/H]) using a back-propagation (BP) network. The proposed scheme has been eval¬ 
uated on both real spectra from Sloan Digital Sky Survey (SDSS)/Sloan Extension for 
Galactic Understanding and Exploration (SEGUE) and synthetic spectra calculated 
from Kurucz’s new opacity distribution function (NEWODF) models. The best mean 
absolute errors (MAEs) are 0.0060 dex for log Teff, 0.1978 dex for log g and 0.1770 
dex for [Fe/H] for the real spectra and 0.0004 dex for log Tgff, 0.0145 dex for log g 
and 0.0070 dex for [Fe/H] for the synthetic spectra. 

Key words: methods: statistical-techniques: spectroscopic-stars: atmospheres-stars: 
fundamental parameters 


1 INTRODUCTION 


With the development of astronomical observation technol¬ 
ogy, more and more large-scale sky survey projects have been 
proposed and impleme nted: for exampl e , the Sloan Dig i- 
tal Sky Survey ISDSS: IVork et aDl200d : lAhn et all 1201^ . 
the Large Sky Area Multi-Object Fibre S pectroscopic Tele¬ 
scope (LAMOST/Guoshoujing T elescope: IZhao et al.ll200t] : 


ICui et al.l l2012l: ILuo et a.1 


vey ( Gilmore et al. I I2OI2I: 


|20lj) and the Gaia-ESO Sur- 


Randich et al. 1120131 1. Large sky 


survey projects can automatically observe and collect mass 
astronomical spectral data. However, the speed of manu¬ 
ally processing these massive, high-dimensional data can¬ 
not adapt to the rapid growth of data. Therefore, auto¬ 
matic processing and a nalysis methods ar e urgently needed 
for astronomical data dZhang et alllioo^ l. Because of this, 
automatic estimation of stellar atmo spheric parameters 
from spectra is currently a hot topic (IRe cio-Blanco et all 


200^ Fiorentin^t aiy igOO^^ ^ et al.ll2010l : Wu et al.ll201ll : 


Liu. Zhang fc Lu|l2014lL 

Estimating the stellar parameters (Teff, log g, [Fe/H]) 
from spectra can be abstracted to a process of finding a map¬ 
ping from a stellar spectrum to its parameters. If there is a 
set of representative stellar spectra with known parameters 
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(training data), a supervised machine learning method can 
learn the mapping from training data: for example nearest 
neighbor (NN), artificial neural networks (ANN) and sup¬ 
port vector machines (SVM). Based on the learned map¬ 
ping, we can estimate the atmospheric parameters for a new 
spectrum. 


An astronomical spectrum usually consists of thousands 
of fluxes and it is necessary to extract a small number 
of features to estimate stellar parameters accurately (Sec- 
tion l8.3ll . Orthogonal transforms such as wavelet transforms 
(WTs) and principal-component analysis (PCA) are popular 
choices for extracting features. They give a complete repre¬ 
sentation of data by a set of complete orthogonal bases. By 
selecting a small number of transform coefficients as fea¬ 
tures, dimensionality reduction is reached. The wavelet co- 
efficients of a spectrum are used to estimate stellar parame- 


ters (iManteiga et al.l l201Cll: Lu. Li fc Lill2012l: Lu et al.l 2()lT 


and c lassify spectra ( Guo. Xing fc Jiangl 20041 : Xing fc Gud 


I2OO6I) . As the most popular dimension-reduction method. 


sis (IWhitnevI 

1982 

: iBailer-Jones. Irwin & Von HioDell 

1998; 

Zhang et al.l 

2005 

: ISingh et al. 20061: Zhang et al.l 

2006; 

Rosalie er al.l |2010|). Many parameter-estimation schemes 


use the projections of data/spectra on some principal com¬ 
ponents directly as spectral features and have achieved 
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good results dZhang et aTI l2005l . l2006l : ISingh et alJ l2006l : 
iRosalie er al.ll201^ 

The motivation of our research is to explore the fea¬ 
sibility of predicting atmospheric parameters (Teff, log g, 
[Fe/H]) from a small number of descriptions computed from 
information within a local wavelength range of a spectrum. 
For convenience, this kind of description is referred to as 
‘local and sparse features’. This characteristic of local rep¬ 
resentation helps to trace back the potential spectral lines 
effective in estimation (See section [531l . 

This work proposes a scheme based on autoencoders, 
convolution and pooling techniques to extract local and 
sparse features. Autoencoder and convolution operations 
give a statistical non-orthogonal decomposition, which leads 
to a redundant and overcomplete representation of data. 
The ‘overcomplete’ representation (equation 1151) of a spec¬ 
trum has many more dimensions than the original spectrum 
(equation [T]). While complete representations based on or¬ 
thogonal transforms are mature and popular feature extrac- 
tion methods, the redundant, ‘ove rcomplete’ representations 
llOlshausenll200l| : iTeh et al]l2003ll have been advocated and 
used successfully in many fields. In literature, researches 
dYee et al.ll2003ll show that redundancy and overcomplete¬ 
ness help in computing some features in subsequent pro- 
cedures, with improved robustness in the presence of noise 
llSimoncelli et ai]ll992li. more comp actness and more inter- 
pretability i Mallat and Zhan^ligQd l. 

From the overcomplete representation, a sparse repre¬ 
sentation (equation [TSJ is computed using pooling and max¬ 
imization operations. The pooling and maximization opera¬ 
tions are actually a competition strategy. In this procedure, 
much redundancy is removed by the competitions between 
multiple redundant components. This competition helps to 
restrain the copies with more noises and a robust represen¬ 
tation is obtained. A typical advantage of this scheme is 
that it can express many suitable and meaningful structures 
in data in some applications. It is shown that this scheme 
does extract some meaningful local features lsection l5.3ll for 
automatically estimating atmospheric parameters (Section 

EJ. 


The rest of this article is organized as follows: Sec¬ 
tion E] describes the spectra used in this work. Section [3] 
presents the framework of the proposed scheme. Section |4] 
introduces the autoencoder network, the concept of basic 
structure elements (BSEs) and the BSE learning method 
using autoencoders. Sections [5] and [6] describe the pro¬ 
posed feature-extraction method based on BSEs and the 
estimating method back-propogation (BP) network, respec¬ 
tively. Section [7] investigates the optimization of the pro¬ 
posed scheme. Section |8] reports some experimental evalu¬ 
ations and discusses the rationality and robustness of the 
proposed scheme. Section [9] concludes this work. 


2 DATA SETS AND PREPROCESSING 

2.1 Real spectra 

In this article, we use two spectral sets to evaluate the 
propo sed scheme. 5000 stell ar spectra selected from SDSS- 
DR7 (lAbazaiian et al]|2009l l compose the real spectrum set. 
Each spectrum has 3000 fluxes in a logarithmic wavelength 


range [3.6000, 3.8999] with a sampling resolution of 0.0001. 
The ranges of the three parameters are [4163, 9685] K for 
Teff, [1.260, 4.994] dex for log g and [-3.437, 0.1820] dex for 
[Fe/H]. 

2.2 Synthetic spectra 

A set of 18 969 synthetic spectra is generated with Ku- 
ruc z’s new opacity distribu tion function (NEWODF) mod- 
elsj[Cagtelli_^^j^ruc ^ using the package SPECTRUM 
(iGrav fc CorbaIlvlll994l ') and 830 828 atomic and molecular 
lines. The solar atomic ab undances we used are derived from 
ICrevesse fc Sau\^ (Il998l l. The grids of the synthetic stellar 
spectra span the parameter ranges [4000, 9750] K in Teff 
(45 values, step size 100 K between 4000 and 75 00 K and 
250 K between 7750 and 9750 K), [1, 5] dex in log g (17 
values, step size 0.25 dex) and [-3.6, 0.3] dex in [Fe/H] (27 
values, step size 0.2 between -3.6 and -1 dex, 0.1 between -1 
and 0.3 dex). 

2.3 Preprocessing and data partitioning 

The proposed scheme is implemented based on ANN. Usu¬ 
ally an ANN requires that each input component has been 
normalized to eliminate impacts on input data resulting 
from range differences from flux to flux. Suppose that 

X = (*1, • • • ,xi)^ (1) 

is a spectrum. The normalized spectrum x is calculated by 


where ||a ;||2 = (5I]i=i In addition, to reduce the dy¬ 

namical range and in order better to represent the uncer- 
taint ies of spectral data, Teff is replaced by log Teff in both 
sets llFiorentin et al.l|2007l : iLi et al.l[2014h . 

The proposed scheme of this work belongs to the class 
of statistical learning methods. The fundamental idea is to 
discover the predictive relationships between stellar spectra 
and atmospheric parameters Teff, log g and [Fe/H] from em¬ 
pirical data, which constitutes a training set. At the same 
time, the performance of the predictive relationships dis¬ 
covered should also be evaluated objectively. Therefore, a 
separate, independent set of stellar spectra is needed for 
this evaluation, usually called a test set in machine learning. 
However, most learning methods tend to overfit the empir¬ 
ical data. In other words, statistical learning methods can 
unravel some alleged relationships from the training data 
that do not hold in general. In order to avoid overfitting, 
we need a third independent spectrum set for optimizing 
the parameters (Section ^ of the framework that need to 
be adjusted objectively when investigating the potential re¬ 
lationships. This third spectrum set and the reference par¬ 
ameters constitutes the validation set. 

Therefore, in each experiment, we split the total spec¬ 
trum samples into three subsets: a training set (60 percent), 
a validation set (20 percent) and a test set (20 percent). 
The training set is the carrier of knowledge and the pro¬ 
posed scheme should learn from this set. The validation set 
is a mentor/instructor of the proposed scheme that can inde¬ 
pendently and objectively give some advice to the learning 
process. The training set and the validation set are used 
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Table 1. Roles of the three data sets 


Data set 

Roles 

Training set 

(1) Generate spectral patches to construct 
a training set for an autoencoder (Subsec¬ 
tion [T2j| 

(2) Learn in optimizing the configuration 
(Section O 

(3) Train the BP network (Section [GJ 

Validation set 

Evaluate the performance of the learned 
spectral parameterization in optimizing 
the configuration (Section (TJ) 

Test set 

Evaluate the performance of the proposed 
scheme (Section [Sj 



Figure 1. A flowchart to show the order of procedures in the 
proposed scheme. The procedure ‘Learning BSE’ is only used in 
investigation/training. In application/test stage, we can extract 
features by convolution and pooling after normalizing spectra us¬ 
ing the learned BSE in training. 


in establishing a spectral parameterization, while the test 
set acts as a referee to evaluate the performance of the es¬ 
tablished spectral parameterization objectively. The roles of 
these three subsets are listed in Table [T] 


3 FRAMEWORK 

There are multiple procedures in our researches; a flowchart 
shown in Fig. [T] illustrates the end-to-end flow. 

Overall, our work can be divided into two stages: (1) a 
research stage and (2) an application stage. These two stages 
can be implemented automatically based on the flowchart in 
Fig.ni As shown in this flowchart, in research stage, we can 
obtain an optimized conhguration for the proposed scheme 
and a spectral parameterization by which we can map a 
spectrum approximately to its atmospheric parameters. In 
the application stage, we can compute the atmospheric par¬ 
ameters from a spectrum. More about optimization of the 
configuration is discussed in Section [3 

This work used multiple acronyms and notations. To 
facilitate readability, we summarize them in Table [2l 



Figure 2. A framework of an autoencoder. The number of output 
nodes is equal to that of input nodes. The learning objective of 
the network is to make the output hyyb(A close as possible to 
input X. 

4 LEARNING BSES USING AN 

AUTOENCODER 

4.1 An Autoencoder 

An autoencoder is a special kind of ANN, ini¬ 
tially pro posed as a data dim e nsiona lity reduction 
scheme llHinton fc Salakhutdino^ l2006l) and now 

is _ wid ely used in image analysis ilTan fc EswaranI 

1201(1: IShin et al.l 1201 ill and speec h proc essing 

dVishnubhotla . Fernandez fc RamabhadranI 1201(1 : 

iDeng et al.ll2013f L 

An autoencoder adopts the framework shown in Fig. 
[2] and is usually used to extract features by unsupervised 
learning. In an autoencoder, there is only one hidden layer 
and the number of nodes in the output layer is equal to that 
in the input layer. The output yi is an approximation of the 
corresponding input Xi, 

t/i « Si, i = 1, • ■ • , m. (3) 

The learning objective of an autoencoder is to find a set of 
weights, labeled with lines in Fig. [3 between the nodes in 
different layers based on a set of empirical data (a training 
set). In other words, the autoencoder tries to approximate 
an identity function on the training set. 

By setting the number of hidden nodes to be far smaller 
than that of input nodes, an autoencoder can achieve dimen¬ 
sion reduction. For example, when an autoencoder has 200 
input nodes and 20 hidden nodes (equivalently, m = 200 and 
n = 20 in Fig. [2J, the original 200-dimensional input could 
be ‘reconstructed’ approximately from the 20-dimensional 
output of the hidden layer. If we use the output of the hid¬ 
den layer as a representation of an input of the network, the 
autoencoder plays the role of a feature extractor. 

To introduce the implementation details of the autoen¬ 
coder, we utilize the notations in an online tutorial “UFLDL 
Tutorial” ([Andrew et al.l[201ol ~l. 

In the network of Fig. [2j let the layer labels be 1 for 
the input nodes, 2 for the hidden nodes and 3 for the out¬ 
put nodes and suppose WjP (I = 1,2) represents a weight 
between the ith node in the Zth layer and the j’th node in 
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Table 2. Acronyms and notations used in this work. AN: acronym or notation. 


AN 

Meaning 

First appearance 

AN 

Meaning 

First appearance 

aT (x) 

the output of the jth node in 

Section 14. ll 

ap 

an atmospheric parameter 

Section [71 

bf 

the Zth layer for an input x 



log Teff , log g or [Fe/H] 


the bias of the jth node in the 

Section 14. ll 

/3 

a regularized parameter in the 

Section 14.11 


(/ + l)th layer 



objective function J(-, •) 


BSE 

basic structure element 

Abstract 

DCP 

description by convolution and 
pooling 

Section 15.21 

/(■) 

an estimation method 

Section [6] 

ff(-) 

an activation function of an au¬ 
toencoder 

Section 14.11 

hw,h 

the mapping of an autoencoder 

Section 14. ll 


the objective function of an au¬ 
toencoder 

Section 14.11 

KL{. II .) 

relative entropy of the average 

Section 14. ll 

A 

a weight decay parameter in 

Section 14.11 


output of a hidden node and its 
expected output 



the objective function J(-, •) 


A 

optimized value for A. p, n, 

Section [7| 

m 

number of nodes in the input 

Section|4.1| 


Np, and are defined 

similarly 



layer of an autoencoder 


n 

number of nodes in the hidden 

Section 14.11 

„BP 

^hl 

number of hidden layers in a 

Section [§] 


layer of an autoencoder 


BP network 


^BP 

^hl 

initialized value of 

Section [7| 

^BP 

number of nodes in the hidden 
layers of a BP network 

Section [H] 

i^BP 

^nhl 

an initialization of 

nnl 

Section [7| 

N 

number of samples in a data set 

S or 5*^ depending on its con- 

Section 14.11 





text 


Np 

number of pools 

Section |5.2| 

P 

an desired activation level 

Section |4.1| 

Pj 

the average activation of the 

Section 14. ll 

RR\ 

restricted (search) range for 

Section [7| 


jth hidden node of an autoen- 



A. RRpi RRni RRjsf^, 



coder 



RR„bp and RR^bp are de- 

’‘‘hl '‘‘nhl 

fined similarly 


S 

a data set with spectra and at¬ 
mospheric parameters 

Section [3 

Sbse 

a set of BSEs 

Section 14.21 

Sir 

a training set of stellar spectra 

Section 14.21 

gtr.ae 

a training set for an autoen¬ 
coder, consists of some spectral 
patches 

Section 14.11 

the maximum convolution re- 

Section 15.21 

w 

description by convolution and 

Section 15.21 


sponse with the jth BSE in the 
gth pool 


wT 

pooling (DCP) 


the jth BSE 

Section |4.1| 

a weight between the ith node 

Section |4.1| 




in the Zth layer and the jth 
node in the (Z -|- l)th layer 



X 

a spectrum or a spectral patch 
for an autoencoder depending 
on its context 

Section 12.31 

X 

a normalized spectrum 

Section 12.31 

y 

output of an autoencoder 

Sectionl4.ll 

z 

a convolution response of a 
spectrum with BSEs 

Section 15. ll 


the {I + l)th layer, {I = 1 , 2 ) is the bias of the jth node 
in the {I + l)th layer and aT {x) {I = 1, 2, 3) is the output of 
the jth node in the ith layer for input x = (xi, ■ ■ • , Xm)'^■ 
Then, a[^\x) = Xi for the input nodes, Y^^x) = yt for the 
output nodes. 

For compact expressions in equations and , we 
introduce three variables si, S 2 and S 3 respectively repre¬ 
senting the number of nodes in three layers of an autoen¬ 
coder network (Fig. [2]). Then, si = m, S2 = n and S3 = m 
and the relationship between the nodes in different layers is 

= 5(E )’' = 1’ 2 . (4) 

i=l 


activation function. Two common choices for the activation 
function are a sigmoid function 

5 ( 2 ) = rr -^— (5) 

' ^ 1 -I- ^ ^ 

and a hyperbolic tangent function 


9{z) 



gZ _|_ g-z ■ 


( 6 ) 


This work uses the sigmoid function in equation ®. 

Overall, the network in Fig. [5] implements a non-linear 
mapping hw,b{-) from an input x — (xi,--- ,Xm)'^ to an 
output y = (yi, • • • 


In equation ([4]), the g{-) is referred to in the literature as an 


y = hw,b{x), 


(7) 
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where W = {Wff} represents the set of the weights of an 
autoencoder network and b = {bf^} the set of biases. 

Suppose that is a training set for an autoencoder. 

To obtain the parameters W and b for an autoencoder net¬ 
work based on equation we can minimize an objective 
function, J, in equation ® 

Jiw,b) = ^ E (^iiV6 (*)-®ii") 

x£Str.ae 

. 2 Si SZ + 1 S2 

+^EEE(EP)'+/^EKL(piiPf), (8) 

i=i i=i j=i 

where pj is the average output of the jth hidden node, 

= ^ E (9) 

jj3^5tr_ae 

KL(p II Pj) = plog^-b (1 - p)log-i—( 10) 
P 1 - Pj 

N is the number of samples in a training selQ 5'*''-“® and 
A is 0, P Is 0 and p > 0 are three preset parameters of 
the spectral parameterization. These three preset parame¬ 
ters control the relative importance of the three terms in 
equation ®. In the literature, the A is usually referred to 
as a weight decay parameter. 

In equation ®, the first term represents an empirical 
error evaluation between the actual output and the expected 
output of an autoencoder and ensures a good reconstruction 
performance of the network. The second term, a regulariza¬ 
tion term of \ is used to overcome possible overfitting 
to the training set by reducing the scheme’s complexity. 

The third term with weighted coefficient /3 is a penalty 
term for sparsity in outputs of the hidden layer. KL(p || pj) 
is the relative entropy of the average output, pj, and a de¬ 
sired activation level p. KL(p || pj) increases monotonically 
with increasing distance between pj and p and encourages 
the average activation, p, of the hidden layer to be close to 
a desired average activation p. 


4.2 Learning BSEs 

BSEs consist of a set of templates of spectral patches with 
a limited wavelength range. Using the BSE, we can extr¬ 
act local and sparse features (Section®. This work studies 
the feasibility of learning BSEs through an autoencoder for 
automatic estimation of atmospheric parameters. In an au¬ 
toencoder, the weighted sum YAiL\ in iii® 

jith hidden node is essentially a projection of an input x 
on the vector , ■ ■ ■ , of weights, 

which is similar to the coefficients of a vector in a coordinate 
system. Thus, can be regarded as a “basis” for the in¬ 
put data and in this article we name them a basic structure 
element (BSE), where j = 1, - ■ ■ ,n. 

To learn BSEs through an autoencoder, a training set 
was constructed to represent the local information of 
a stellar spectrum. Let 5"*” be a training set of stellar spectra 
(Section®. The BSE training set ig constructed from 

a spectral training set in the following way: 

^ The trjie is the abbreviation of a training set for an autoen¬ 
coder. 


(1) randomly select a spectrum, x — {xi,--- ,xi), from 
S*”, where I > 0 represents the number of fluxes of a spec¬ 
trum; 

(2) randomly generate an integer j satisfying 1 ^ j ^ 

n —m-l-1, where m > 0 represents the dimension of a sample 
in and is consistent with the number of input nodes 

of the autoencoder in Fig. ® 

(3) take {xj,Xj+i, ■ • ■ , Xj+m-i)'^ as a sample of 

By repeating the above three procedures, we obtain a 
BSE training set Therefore, actually consisits 

of a series of spectral patches. Considering the widths of lines 
of a stellar spectrum, m is empirically taken as 81 in this 
work. In the proposed scheme, 100 000 such patches are gen¬ 
erated to constitute the BSE training set. These patches are 
not generated from a specific wavelength position, therefore 
gtr.ae exp]-0ggeg the general structures of all spectral patches 
with length m. 

To learn a set of BSEs, we input the generated training 
set into an autoencoder (Section 14.11) and compute a 

set of spectral templates, BSEs: 

5bse = { 1 U(.'\---( 11) 

where every BSE, Wj^\ is a vector representing a basic pat¬ 
tern of spectral patches, 

luj') = • • • , (12) 

5 FEATURE EXTRACTION 

This work proposes to extract features by performing con¬ 
volution and pooling operations on the computed BSEs and 
a stellar spectrum. 


5.1 Convolution 


Let 

X = (*1, • • • ,xi)'^ (13) 


denote a spectrum. Using the extracted BSEs Sbse in equat¬ 
ion uni and a convolution operation, we filter the spectrum 

X'. 

m 

= (14) 

P=1 


and transform the spectrum x into 

.=((.W)",..., (.(")) Y 


-Iz™ ... 2 ( 1 ) 


■> ? 




where i = 1, • • • , / — m -h 1, j = 1, • • • , n. Then, 

Aj) _ (Aj) ~U) 

2 


(15) 


(16) 


is the convolution response vector of the jth BSE structure 
in equation dni- 

In this work, a BSE structure template has m = 81 com¬ 
ponents ('equation 1121) and a SDSS spectrum is represented 
with I — 3000 fluxes. Therefore, there are I — m + 1 — 2920 
convolution responses for any one BSE structure template 
and spectrum. From the n = 25 BSE structure templates 
in equation m, we obtain (Z — m -|- 1) x n = 2920 x 25 
convolution responses (in equation [IS) for every spectrum. 
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5.2 Pooling 


However, there are many redundancies in the convolution 
response description, z in equation di), of a spectrum for 
physical atmospheric parameter estimation. Redundancies 
usually result in overfitting. To overcome this problem, a 
‘pooling’ method is adopted to reduce dimension by merging 
the features of different positions. In this pooling method, 
we first equally partition the convolution response descrip¬ 
tion into Np pools according to their wavelength positions, 
then choose the maximum response in each pool as the final 
spectral feature: 

= max{2p\ ^ X (g - 1) -b 1 ^ i ^ ^ X q}, (17) 

where q — 1, - ■ ■ , Np,j = 1, • • • , n, is the length of every 
pool, I is the number of fluxes in a spectrum and Np is 
the number of pools. The Vg^'^ is referred to as a maximum 
convolution response. Thus, the spectrum x is transformed 
to 


/ (1) (1) (n 




(18) 


For convenience, the detected features in equation m are 
named description by convolution and pooling (DCP). For 
a SDSS spectrum, the dimension of its description decreases 
from (1 — m + 1) x n = 2920 x 25 (convolution responses) to 
Np X n = 10 X 25 (DCP), if the pool number Np is 10. 


5.3 Discussions on convolution and pooling 


Pooling is a form of non-linear downsampling operation, 
which combines the responses of feature detectors at nearby 
locations into some statistical variables. The pooling method 
was proposed in Hubei and Wiesel’s work on complex cells 
in the visual cortex dHubel fc WieseJ 1 19621 ) and is now 
used in a large number of applications: for example com- 
*uter vision and speech recogn ition. Theoretical analysis 
Boureau. Ponce fc LeCunI l20ld ) suggests that the pool¬ 
ing method works well when the features are sparse. In 
image- and speech-processing fields, convolution and pool¬ 
ing based on local bases has been proven to be an effec¬ 
tive feature-extraction method ( Nagiet ^1201 ll : IShin et ahl 
I 2 OIII : lAshish. Murat fc Anthonu 201,’j ). 

In theory, the convolution operation can reduce negative 
effects from noise and the pooling operation can restrain the 
negative influence from some imperfections (e.g. sky lines 
and cosmic ray removal residuals, residual calibration de¬ 
fects) by competing between multiple convolution responses 
of the extracted BSEs and a spectrum (the maximization 
procedure). Some effects of convolution and pooling are in¬ 
vestigated in the following part of this subsection and fur¬ 
ther discussed in Sections l8.3l and l8. 41 based on experimental 
results. 

Using SDSS training data ('Section l2.ll) and the scheme 
proposed in Section r4.2l we learned 25 BSE structures (Fig. 
[3| by setting the number of hidden nodes n = 25 in Fig. [2] 
and equation HU, where the optimality of n =25 are inves¬ 
tigated in section [71 It is shown that the BSE structure tem¬ 
plates characterize the local structure of spectra. Using the 
learned BSE structures and the convolution-pooling scheme 
described in this section, we can extract features from every 
spectrum. 


For example, Fig. 4(a) presents a spectrum from SDSS; 
there is a stitching error near 5580A. In Fig. 4(b)[ the curve 
shows the convolution response vector of the third BSE 
structure on the spectrum in Fig. 4(a)| the points la- 
beled with quadrangles are the pooling responses {vg ,q = 
1, • • ■ , Np} in equation nil). The results in Fig. |4(b)| and 


4(c) show that the responses of the spuriously strong signal 


of the stitching error are reduced. 

To investigate characteristics of the extracted features 
further using convolution and pooling, we calculate the sta¬ 
tistical histogram of the maximum convolution responses on 
5000 SDSS spectra ('Section l2.1ll . Every maximum convolu¬ 
tion response Vg^'^ has a specific wavelength position (Fig. 
|4(b)[ equation 1171) . The statistical histogram of the maxi¬ 
mum convolution responses is obtained by cumulating the 
number of pooling responses Vg^'^ of 5000 SDSS spectra at 
each wavelength position. Fig. 4(c) shows the statistical his¬ 
togram of the maximum convolution responses correspond¬ 
ing to the third BSE structure and more statistical 

histograms of the maximum convolution responses are pre¬ 
sented in Fig. (5) The results in Fig. [5] also show that the 
effects of stitching errors near 5580A in Fig. 4(a) are negli¬ 
gible in the pooling response. More experimental investiga¬ 
tions on the stitching error are conducted in Section [8.41 

The statistical histograms of the maximum convolution 
responses (Figs |4(c)] and [5| on 5000 SDSS spectra also show 
that the pooling responses are statistically sparse: only at a 
few wavelength positions are there non-zero responses. The 
sparseness helps to trace back the physical interpretation of 
the extracted features. 

That is to say, most of the cumulative responses are 
close to zero except a few wavelength patches with large 
cumulative responses. As can be seen in Fig. [4l these local 
maximum responses appear near the wavelength of evident 
fluctuation in a spectrum. In the pooling process, the re¬ 
sponses of these wavelengths are frequently taken as spec¬ 
tral features. Some of the new features characterize such 
local structures in spectra and they are local and sparse. 

Considering that a pooling response Vg^^ is obtained by 
a convolution calculation ('equations 1 171 and 1 1411 . each pool¬ 
ing feature is actually calculated from some spectral fluxes 
in a wavelength range. Therefore, we present the wavelength 
ranges in the second column of Table [3] for those features 
with large cumulative responses in Fig. O 

Although the features are not extracted by identifying 
some spectral lines, the results in Fig. [5] do show that the 
detected features are near some spectral lines potentially 
existing in a specific stellar spectrum (the third column of 
Table El) . 


6 ESTIMATING ATMOSPHERIC 
PARAMETERS 

As a typical non-linear learning method, ANN has been 


sphere parameters 

Bailer- Joned 2000l: Willemsen et alJ 

2 OO 3 I: lOrdonez et al.1 

2 OO 7 I. I 2 OO 8 I: Zhao. Luo & Zhao 

20081: 

Pan & Tul 2 OO 9 I: Manteiea et al.1 120101: Giridhar et al. 

I 2 OI 1 I) 

and spectrum classification (iGulati et al.lll994l:lHiDDel et al.1 


Schierscher fc PaunzerJl201ll l. For example. Manteiea et al.1 
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Figure 3. A visualization of BSEs when the hidden layer has 25 nodes. 
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Figure 4. Convolution responses of a spectrum and a histogram of the maximum convolution responses in a pooling process with pool 
number Np = 10. (a) A normalized SDSS spectrum, (b) Convolution responses of the spectrum shown in (a) and the third BSE shown 
in Fig.O In (b), nine vertical dashed lines are boundaries of each pool (the pooling is implemented in logarithmic wavelength space, so 
the sizes of multiple pools are unequal in a wavelength space) and ten quadrangles indicate the maximum convolution response in every 
pool, (c) The statistical histogram of the maximum convolution responses in the pooling process for the third BSE in Fig. [3] on 5000 
SDSS spectra fSection I2.1I1 . 


































8 Tan Yang, Xiangru Li 



1 

1 




BSE (24) 



1 

, 


. 

, 

BSE (16) 



i 



.. 


BSE (14) 


J 

) 

, 


. 


BSE (13) 



, . i 


i 

I .. 

. 

1 BSE (6) 


i 

.1 


,, . 


BSE (4) 



ij 

Li J-J 

1 

.. ,... 

1 

BSE (3) 



li I .It 1 il. »> -I 1 - i-JL..< I li_ I ■ - - - I ■ j::. 

4000 4500 5000 5500 6000 6500 7000 7500 


Figure 5. The cumulative histogram of the local maximum response position for the third, fourth, sixth, 13th, 14th, 16th, 24th BSE 
structures. The dashed lines indicate eight significant local maximum cumulative responses of pooling operation on 5000 SDSS spectra. 


Table 3. The wavelengths of the eight local-maximum cumula¬ 
tive responses and some potential lines near them. WPT: wave¬ 
length position of typical local-maximum cumulative response of 
a pooling operation on 5000 SDSS spectra, this position is rep¬ 
resented by a three-dimensional vector (a b c), where a, 6 , c 
are respectively the starting wavelength, central wavelength and 
ending wavelength and log^Q 6 = (logj^Q ^ P®" 

tential lines probably related to the description of the detected 
feature. Wavelength b is the position of local maximum cumula¬ 
tive and fiuxes from range [a c] determine the response on b in 
a convolution process. 


No. 

WPT(A) 

PL 

1 

(4000 4074 4150) 

Ca I, H (5, He I 

2 

(4199 4276 4356) 

Cal, H 7 

3 

(4233 4311 4391) 

Fell, Nall, Ol,FeI,CaIII 

4 

(4744 4831 4922) 

Fel, OIII, Nal, Nall, Oil, Fell 

5 

(5047 5141 5237) 

Fell, He I, Nal.Calll,, OIII, Fel 

6 

(5757 5864 5973) 

Nal,Fel,Oil 

7 

(6401 6520 6641) 

Ha,CaH 

8 

(7473 7612 7754) 

Fel, Fe 11,0III, Ol 


ll2010ll extracted spectral features using fast Fourier trans¬ 
forms (FFTs) and discrete wavelet transform (DWT) and 
estimated the parameters Teff, log g, [Fe/H] and [a/Fe] by 
an ANN with one h idden layer fro m FFT coefficients and 
wavelet coefficients. iGiridhar et al.l ll201lll studied how to 
estimate atmospheric parameters dire ctly from spectra by 
a BP neural network. |Pan fc Tul ll2009ll proposed to param¬ 
eterize a stellar spectrum using an ANN from a few PCA 
features. 

In this article, we use BP networks to learn the mapping 


from extracted DCP features (Section [5]) to stellar parame¬ 
ters Teff, log g and [Fe/H]. Training of BP networks is an 
iterative process and in each iteration the estimated errors 
are calculated on two sets: the training set and the valida¬ 
tion set. A BP network optimizes its parameters by mini¬ 
mizing the difference between the network’s output and the 
expected output (e.g., stellar atmospheric parameters) ac¬ 
cording to the estimated errors in the training set and this 
process will stop when the estimated errors on the validation 
set have no improvement in successive iteration steps. This 
can avoid overhtting. 

In a BP network, there are two preset parameters: the 
number of hidden layers and number of nodes in each hidden 
layer. For convenience, the former is denoted by , the 
latter In this work, we investigated the cases — 1 

and — 2 for computational feasibility. If = 1, 
is a positive integer. If = 2, is a two-dimensional 
row vector consisting of the numbers of nodes in the first 
and second hidden layers. 

Suppose that S = {(a:, y)} is a data set, where x repre¬ 
sents the information of a spectrum and y is an atmospheric 
parameter. In this work, the accuracy of a spectral param¬ 
eterization /(■) on a data set S is evaluated by the mean 
absolute error (MAE) 

MAE(/(-)) = ^ \f[x)-y\, (19) 

{x,y)es 

where N is the number of samples in S. 


7 OPTIMIZING THE CONFIGURATION 

The proposed scheme consists of four steps (Fig. [T] Sec¬ 
tion Ell. In the second step, ‘Learning BSE’, there are four 
parameters. A, /I, p and n, to be preset, where n denotes 
the number of nodes in the hidden layer of an autoencoder 
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(Fig. [2]). In the third step, ‘Extract features by convolution 
and pooling’, there is a preset parameter, Np, representing 
the number of pools (Section [5]). In the estimation method, 
BP network, there are two preset parameters and 
(Section [6]) . 

To optimize the configuration of the proposed scheme, 
the spectrum-parameterization scheme was estimated from 
a training set from SDSS. The performance of the estimated 
spectrum-parameterization scheme was evaluated on a vali¬ 
dation set from SDSS (Section [^. 

Therefore, the optimal configuration 

{\,P,p,h,Np,hhf ,h’fhi)ap (20) 

can be found by 

„„MAE(A,,0,p,n, (21) 

X,f3,p,n,Np,nBP,riBP 

where, ap —Teti, log g or [Fe/H] and MAE is the prediction 
error on the SDSS validation set. The spectral parameter¬ 
ization is learned from a SDSS training set with a specific 
configuration of A, /3, p, n, Np, , n^hi and ap. 

Because there is no any analytical expression for the 
objective function in equation m, we can obtain the op¬ 
tional configuration {X, P, p,n, Np,nf^f ,n^^i)ap in theory 
by repeating the four procedures of the proposed scheme 
(Fig. [H Section o with every possible configuration of 
A, /3, p, n, Np, n^f, and choosing the one with minimal 
MAE error as the optimal configuration. 

However, this theoretical optimization scheme is infea¬ 
sible as regards computational burden. Therefore, instead of 
obtaining an optimal configuration, we propose to find an 
excellent/acceptable configuration, a suboptimal solution. 

To find a suboptimal solution, we restrict the search 
ranges for A, /?, p, n and Np empirically, as follows: 

RRx ={0.001, 0.0023, 0.0036, 0.0049, 

( 22 ) 

0.0061, 0.0074, 0.0087, 0.01}, 

RRp ={0.005, 0.0114, 0.0179, 0.0243, 

(23) 

0.0307, 0.0371, 0.0436, 0.05}, 


Table 4. The suboptional configuration obtained for the pro¬ 
posed scheme (Fig. [T] Section |3}. A, /3, p and h are the optimized 
values of four parameters in the second step, ‘Learning BSE’ (Fig. 
[Tjl. where h denotes the number of nodes in the hidden layer (Fig. 
H. Np represents the number of pools (Section[5jl in the third step, 
‘Extract features by convolution and pooling’ (Fig. and 

'^nhl optimized values of two preset parameters in the 

estimation method, BP network (Section!^. 


Parameters 

A 

/3 

P 

h 

Np 

f,BP 

^hl 

fjBP 

’*'nhl 

log Teff 

0.0074 

3 

0.0050 

25 

10 

2 

(14,10 ) 

log g 

0.0087 

3 

0.114 

25 

12 

2 

(16,4) 

[Fe/H] 

0.0087 

3 

0.114 

25 

15 

2 

(22,6) 


where 

^^nBP = { 1 , 2 }. 

If = 1, then 

RR BP — RRnn- 

^nhl 

If = 2, then 

-R-R^bp = {(a, b)\a ^ RR nn 7 b e RR 

nn J" 7 

where 

RR„„ = {2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24} (28) 

represents possible numbers of nodes in a hidden layer 
of a network investigated in this work. The computed 
, finhi) presented in Table ID 

Note, however, that the restricted search ranges are se¬ 
lected empirically based on performance on validation set, 
the solutions obtained are not usually global minimum ones, 
but some acceptable values with a feasible computational 
burden. If we expand the restricted ranges, it is possible 
that a better configuration can be obtained at greater com¬ 
putational cost. 


= {15, 20, 25, 30, 35, 40}, (24) 


RRnp = {15, 12, 10, 8, 6, 5, 4}, (25) 


and P = 3, where RRx represents a restricted search range 
for A; the other symbols RRp, RRn and RRnp are defined 
similarly. The suboptimal solutions of A, p, n and Np can be 
found by optimizing 


min 

XeRRx,peRRp,neRRn,NpeRRNp 


MAE(A, /3, p, n, Np, Uhl, n„hi,ap) 


(26) 

based on the framework in Fig. [T] where and are 
initialized with fihi = 1 and finhi = 6. The fihi and finui are 
determined empirically based on considerations of balance 
between computational burden and estimate accuracy. The 
configurations obtained are presented in Table |4l 

Based on the configurations of A, /3, n and Np, the sub¬ 
optimal solutions of and are computed by 


min 

hi nhl 


MAE(A, A n, Np,nhf ,n^hi, ap). 


( 27 ) 


8 EXPERIMENTS AND DISCUSSION 

8.1 Performance on SDSS spectra 

From the training set and the validation set consisting of 
SDSS spectra (Section [2]), we obtain a spectral parameter¬ 
ization using the proposed scheme (Section |3l Fig. [TJ and 
the proposed optimization scheme (Section [7] Table HI. 

On a SDSS test set, the MAE errors of this spectral 
parameterization are 0.0060 dex for log Teff, 0.1978 dex 
for log g and 0.1770 dex for [Fe/H] (the row with in- 
dex 1 in Table [Hi. S imilarly, on real spectra from SDSS, 
iFiorentin et al.l |2003) investigated the stellar parameter es¬ 
timation problem using PCA and ANN and obtained accu¬ 
racies of 0.0126 dex for log Teff, 0.3644 dex for log g dex and 
0.1949 de x for [Fe/H] on a SPSS test set based on 50 PCA 
features. iLiu. Zhang fc Lul (l2014l ~l studied the spectrum- 
parameterization problem using least absolute shrinkage and 
selection operator (LASSO) algorithm and Support Vector 
Regression (SVR) method and the optimal MAE errors are 
0.0094 dex for log Te±±, 0.2672 dex for log g and 0.2728 dex 
for [Fe/H]. 
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Table 5. Performance of the proposed scheme on test data sets 


Index 

Data source 

log Teff (dex) log g (dex) 

[Fe/H] (dex) 

(a) Performance of the proposed scheme. More details are presented in Section l8.11 and 18.21 

1 

SDSS spectra 

0.0060 0.1978 

0.1770 

2 

synthetic spectra 

0.0004 0.0145 

0.0070 

(b) Rationality to delete some data components in application (Section 18.3D. 

3 

SDSS spectra 

0.0080 0.2994 

0.2163 

(c) Robustness to stitching error (Section I8.4|l. 

4 

SDSS spectra 

0.0063 0.2371 

0.1827 


8.2 Performance on synthetic spectra 


To investigate the effectiveness of the proposed scheme fur¬ 
ther, we also evaluated it on synthetic spectr a. The synthetic 
spectr a used in this work and those used in iFiorentin et al] 
(120071 ) are all calculated from Kurucz’s model. The synthetic 
data set is described in Section This experiment shares 
the same parameters as the experiment on SDSS data (Sec- 
tion l8.ip and the BP estimation is learned from the synthetic 
training set (Section [2]). 

On the synthetic test set, the MAE accuracies of the spec¬ 
tral parameterization learned from the synthetic training 
set are 0.0004 dex for log Teff, 0.0145 dex for log g, and 
0.0070 dex for [ Fe/H] (the row with index 2 in Table [^. In 
iFiorentin et al.l (l2007t) . the best consistencies on synthetic 
spectra are obtained based on 100 principal components 
and the MAEs are 0.0030 dex f or log Teff, 0.02 51 for log g 


and 0.0269 for [Fe/H](Table 1 in iFiorentin et a h ||2007t )). Us - 
ing the LASSO algorithm and SVR method, iLi et al.N2014l l 
reached MAE errors 0.0008 dex for log Teff, 0.0179 dex for 
log g and 0.0131 dex for [Fe /H]. On the synthetic s pectrum, 
the mean absolute errors in iManteiga et ahl (I 2 OI 0 I ') are 0.07 
dex for log g and 0.06 dex for [Fe/H]. 


8.3 Effective data components, unwanted 
influences and their balances 

The proposed scheme extracts features by throwing away 
some information. In this procedure, it is very probable that 
some useful (at least in theory) components are discarded. 
Is this positive or negative? 

Actually, besides the useful data components in spectra, 
there is also redundancy, and/or noise and pre-processing 
imperfections (e.g. sky lines and/cosmic ray removal resid¬ 
uals, residual calibration defects). Redundancy means that 
some duplication of some components in a system exists. 
Multiple components are probably usually different from 
each other regarding the amount of duplications. Therefore, 
redundancy can disturb the weights of different components, 
which usually results in an erroneous evaluation and reduces 
the quality of learning. Noise and pre-processing imperfec¬ 
tions can mask off the effects of some important spectral in¬ 
formation, e.g. weak lines. Therefore, in theory, it is possible 
that we can improve the estimates of atmospheric parame¬ 
ters by throwing away some data components. 

We conducted an experiment to investigate this theoret¬ 


ical possibility. In this experiment, we estimate atmospheric 
parameters by deleting procedures 2 and 3 (Fig. [!} from the 
experiments in Section [8T] That is to say, we estimated the 
atmospheric parameters using a BP network directly from 
a spectrum without throwing away any information in the 
pooling procedure and the BP estimation network shared 
the same configurations as in the experiment of Section 18.II 
The results are presented in Table [5] (b) (the row with in¬ 
dex 3). The results of the experiments with index 1 and 
3 in Table [S] do not show any evident evidence of losing 
any spectrum parameterization performance when throwing 
away some data components using the proposed scheme. 


8.4 Effects of band stitcbing errors on parameter 
estimation 

There is a significant oscillation in some SDSS spectra near 
5580 A (e.g. Fig. 4(a) I, which is caused by errors in stitching 
the red and blue bands. Figs 4(b)[ 4(c) and [S] show that 
the responses of stitching errors is evidently reduced in the 
procedures of convolution and pooling. 

We also performed one experiment to analyse the effects 
of these ‘noises’ on performances of the proposed scheme. In 
this experiment, the BP estimator is trained with features 
excluding the convolution responses related to stitching er¬ 
rors. For example, in SDSS data, the stitching error appears 
in the fifth pool in our experiments, so we removed the pool¬ 
ing response of the fifth pool for every BSE structure. The 
new MAEs of the three parameters in the SDSS test set are 
presented in Table [5], part (c). 

Actually, in the wavelength range containing stitching 
error, there are both useful components and disturbances 
from noise and stitching errors. However, experiments show 
that the useful components outperform the disturbances in 
the proposed scheme (the experiments with indexes 1 and 4 
in Table [S]). 


9 CONCLUSION 

In this work, we propose a novel scheme for spectral fea¬ 
ture extraction and stellar atmospheric parameter estima¬ 
tion. In the commonly used methods like PCA and Fourier 
transform, every feature is obtained by incorporating nearly 
all of the fluxes of a spectrum. Differently from these, the 
characteristics of our proposed scheme are localization and 
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sparseness of the extracted features. ‘Localization’ means 
that each extracted feature is calculated from some spectral 
fluxes within a local wavelength range (Fig. 4(b) I. ‘Sparse¬ 
ness’ says that the atmospheric parameters can be estimated 
using a small number of features (Fig. 4(c) I. The ‘localiza¬ 
tion’ and ‘sparseness’ signify that many data components are 
thrown away, especially the redundancies, weak informative 
components fSection l8.3ll . However, these weak informative 
components are easily corrupted by noise and pre-processing 
imperfections (Section 18.311 . It is shown that the proposed 
scheme has excellent performance in estimating atmospheric 
parameters (Section [53] Section [8.21 Table [S] (a) ). 
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